Target of this project

In this project, We will analyze the White Wine Data and try to understand which variables are responsible for the quality of the wine. First We will try to get a feel of the variables on their own and then we will try to find out the correlation between them and the Wine Quality with other factors thrown in.

Cheers

Cheers

First look

We have the following variables:

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

We have 12 variables. so what are their types:

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

All of them are numeric. 4898 data points.

Let’s get first numerical overlook at the data.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The summary shows max really far out for residual.sugar, chlorides, free.sulfur… Those might be outlier/reporting problem or really special wines.


Univariate Plots

Let’s plot the distribution of each of the variable as I would like to get a feel of the variables first. Based on the distribution shape, i.e. Normal, Positive Skew or Negative Skew and amount of outliers present into them, this will also help us to get some sense what to expect when I plot different variables against each other.

Quality

First of all, what about quality of those white wines. Also, quality is a numeric, we will add it as a factor.

It looks like a normal distribution. Most of the wines quality are around 5/6/7. As the good quality and the poor quality wines are almost like outliers here, it might be difficult to get an accurate model of the Wine Quality. Let’s look at the other plots.


Alcohol

Let’s check alchol spread

Much more disparate but we have a nice peak around 9.5% by volume. All in all it still looks like a normal distribution, slight skewed towards left.


Residual Sugar

Residual sugar is the amount of sugar remaining after fermentation.

X scale is in log10. We have the same type of distribution but a long tail. Peaks is around 1.5g / dm^3 and data seems to go all the way up until 25g / dm^3. Let’s look at the summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Max is at 65 (which is kind of akward) while the 3rd quartile is at 9.9 and the median at 5.2. So those are quite special wines or there is some error in the data.


Chlorides

Chlorides is salt and I would consider they produce not a great taste in wine.

We have a normal distribution once the x axis is transformed using a log10 function. Like residual.sugar, we have a right longtail of data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The unit is the same as the residual.sugar but the number are way lower. Median is at 0.043g / dm^3. The max is again more than 80 time higher, not normal.


Citric acid

In small quantities, citric acid can add freshness and pops to wine.

Looking at citric.acid with a log10 scale, we have again a normal distribution. ofcourse with few noticible outlier there.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Once again we have a median at 0.32 g/dm^3 whereas the max is at 1.6 and the min is at 0.


pH

pH scale from 0 (very acid) to 14 (very basic). Most wines are between 3 and 3.5.

The pH is normally distributed with a peak without the no need for a log scale. We can see that the data is quite disperse but that as the description says, most of the data points are between 3 and 3.5.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The min and max are once again quite far apart from the median and the 1st and 3rd quartiles respectively, but not so much variation.


Volatile acidity

Volatile acidity is the amount of acetic acid in wine. At high concentration it gives an unpleasant vinegar taste which, I think, is what a low quality wine taste like.

Still a long tail in the data with a peak around 0.3 g / dm^3 of acetic acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The max is really out of range of the other data points. Median is at 0.26 g / dm^3 of acetic acid and the mean is quite similar ar 0.2782 g / dm^3 of acetic acid.


Total Sulfur Dioxide

Total sulfur dioxide is the sum of free (to oxidation of wine) and bound forms of sulfur. At high concentration is can influence taste.

The x axis is transformed using log10. We have several bumps but overall a normal distribution. Some data points seem to be isolated. Let’s look at the variable summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The max is really too high compared to the median and the 3rd quartile. It is an error in data reporting/recording I think.

My main line of inquiries will be the relation between all variables to quality.


Bivariate analysis

Alcohol and quality

Does alcohol level affects the quality?

## [1] 0.4355747

There is a medium correlation (around 0.435) between quality and alcohol, the graph shows that the higher the quality, the better the alcohol. This is especially true for the higher end wines.


Acidity variables

We have 3 types of acidity listed:

  • Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar tast
  • Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

They should all relate to the ph I think.

## [1] -0.4258583
## [1] -0.03191537
## [1] -0.1637482

So Fixed acidity and citric acid correlate quite strongly with the pH (around -0.42 and -0.16 respectively), but not the volatile acidity (around -0.03).

## `geom_smooth()` using method = 'gam'

Volatile acidity and pH are not related. Even for high quality wines, the pH seems to vary greatly. This confirms the low correlation previously found.


Quality and pH

Let’s get a closer look between those pH and quality now.

## $title
## [1] "White Wine pH by quality"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

The higher the quality, the higher the pH it seems, but not very clear.

## [1] 0.09942725

The correlation coefficient between the two is very low at 0.09. So the first impression of the plot is not validated by the number in fact.

What about salt? (Chloride)


Quality and chlorides

As a reminder, the chlorides summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The summary shows that the max is at 0.346 whereas the 3rd quartile is at 0.05. Lets zoom in to have a better view.

The plot is limited between the 0.05 and the 0.95 quantiles. There is some overlaps, but the higher the quality, the least Chloride. (This somewhat prooves our earlier hypothesis that presense of salt affect wine taste badly)

## [1] -0.2099344

The correlation is very low at -0.20. So the relationship between the two variables is not very strong.

Last variable on the obvious affecting taste (and so quality), I think, is sugar.


Quality and residual sugar

No clear relationship between the sugar and quality. The scatterplot shows points all over the place. As for the boxplot, the median semms to higher for mid range quality, but nothing special to notice here.

## [1] -0.09757683

Correlation confirms that there is very low relationships (-0.09), almost negligable.


Multivariate analysis

A closer look at chlorides. ***

Chlorides and alcohol by quality

## `geom_smooth()` using method = 'gam'

The overall trend is downward: less chlorides as the alcohol level gets higher. But we can see that for the top 4 qualities (6,7,8), we got a major concentration around 10 to 12, where salt level rises a bit. It might be some outlier as we only got a few points for those quality.

## [1] -0.2231098
## [1] -0.3199424
## [1] -0.5545504
## [1] -0.5124824

The correlation between chlorides and alcohol are:

  • -0.223 for quality 5
  • -0.319 for quality 6
  • -0.554 for quality 7
  • -0.512 for quality 8

Correlations are weak for 5th and 6th level quality, but quite noticible between 7th and 8th levels.


Residual sugar and alcohol by quality

Let’s look at the same thing with sugar:

## `geom_smooth()` using method = 'gam'

Overall the downward trend to notice here. Let’s confirm this by correlation

## [1] -0.4414825
## [1] -0.4549961
## [1] -0.4809369
## [1] -0.5220108

The correlations are not strong:

  • -0.441 for quality 5
  • -0.454 for quality 6
  • -0.480 for quality 7
  • -0.522 for quality 8

The correlation for all qualities are kind of consistence, 8th bieng on higher side, but we only have a few data points fior this quyality so the correlation might be due to that.

The trends in the plots for chlorides and residual.sugar look kind of the same, both have negative cor. Chlorides and residual sugar might be linked.


Chlorides and residual sugar by quality

## `geom_smooth()` using method = 'gam'

The relation between chlorides and residual sugar seems to go more wiggly (non-linear) as the quality improves. It might make white wine taste unpleasent and unpredictable in the end.

Let see if the ratio of sugar by chlorides is any help:

No chance here. I was hoping for some clusters of points for each quality level, but they can vary widely.

## [1] 0.02252779
## [1] 0.03512032
## [1] 0.2755308
## [1] 0.3148669

The correlation for quality 5, 6 and 7 is quite low (0.02, 0.03 and 0.27 respectively) but the one for 8 is considerable. This might be because we have fewer point on the 8 quality than on other quality level. Furthermore, we have very far off values in both residual.sugar and chlorides, let’s calculate the same value on a subset limited between the .05 and .95 quantiles.

## [1] 0.1124103
## [1] 0.1979728
## [1] 0.3488694
## [1] 0.417167

Correlations are slightly improved with the removing of the extreme quantiles:

  • 0.112 for quality 5
  • 0.197 for quality 6
  • 0.348 for quality 7
  • 0.417 for quality 8

I think this relationship between residual sugar and chlorides is worth investigating to see if other variables might come in play. So let’s map chlorides with sugar and add alcohol as a color and use this as a base for other graphs.


Chlorides, residual sugar and alcohol by quality

We can see as before that the lower the chloride content, the better alcohol. Furthermore, low alcohol levels seems to have less sugar and high chloride content

## [1] -0.2996015
## [1] -0.3502557
## [1] -0.3309418
## [1] -0.2957432
## [1] -0.2876385

The correlation between residual.sugar_chlorides and alcohol for the subset is -0.299 which is low. Breaking it up by quality we have:

  • -0.350 for quality 5
  • -0.330 for quality 6
  • -0.295 for quality 7
  • -0.287 for quality 8

So the correlation does not vary greatly between quality. We do not have a specific relationship for some qualities.


Acidity variables and quality

Let’s go back to our acidity variables:

These box plots doesn’t help much in exploration here., Let’s check correlations.

## [1] -0.1136628
## [1] -0.194723
## [1] -0.009209091

Correlation tells that citric acidity correlates very weakly with the quality (-0.009). On the other hand, volatile acidity (-0.19) and fixed acid (-0.11) are somewhat present, especially volatile acidity. This kind of acidity correlates negatively with quality meaning that as the quality improves, the volatile acidity decreases. The vinegar taste brought by volatile acidity is really hurting the quality.

So we now have alcohol, fixed acid and volatile acidity correlating quite strongly with quality. I will get back to the previous facetted chart of residual sugar and chlorides and add citric acid/volatile acidity to see if I can get some more information.


Chlorides, residual sugar and acidities by quality

## [1] 0.08184145

we can see that as the volatile acidity level increases, the quality seems to go down a bit (-0.19), showing slight negative correlation there.

citric acid and white wine quality are very less related. Seems to flaten out across all levels

## [1] 0.09948825
## [1] 0.2050855
## [1] 0.02915079
## [1] 0.0073993
## [1] 0.0828936

Correlations show that there is in fact nothing. The correaltion for our subset is 0.09. Breaking by quality:

  • 0.20 for quality 5
  • 0.02 for quality 6
  • 0.007 for quality 7
  • -0.08 for quality 8

Chlorides, residual sugar and density by quality

Density is related to alcohol so let see if we can find something here.

There is some trend going on, better wine quality has higher residual.sugar and higher densities.

## [1] 0.672887
## [1] -0.2996015
## [1] 0.7901922
## [1] -0.3502557

Density relates more to the residual.sugar_chlorides ratio than alcohol: 0.67 vs -0.29. We have the same phenomenom for quality 5: 0.79 for density vs -0.35 for alcohol.


Chlorides, residual sugar and sulphates by quality

It seems that we get more sulphate, the more residual sugar and chlorides in the wine.

## [1] 0.09948825
## [1] 0.2050855
## [1] 0.02915079
## [1] 0.0073993
## [1] 0.0828936

Correlations show that there is in fact nothing. The correaltion for our subset is 0.09. Breaking by quality:

  • 0.20 for quality 5
  • 0.02 for quality 6
  • 0.00 for quality 7
  • 0.08 for quality 8 ***

Chlorides, residual sugar and density by quality

Density is related to alcohol so let see if we can find something here.

We get the kind of graph with less dense wines on the left side of the graphs whereas more dense wines are in the right region across all quality levels.

## [1] 0.672887
## [1] -0.2996015
## [1] 0.6908107
## [1] -0.3309418

Density relates more to the residual.sugar_chlorides ratio than alcohol: 0.67 vs -0.29. We have the same phenomenom for quality 6 (0.69 for density vs -0.33 for alcohol).


Chlorides, residual sugar and sulphates by quality

There seems to be no clear picture of sulphate relation with quality, niether with chlorides and residual.sugar.

## [1] -0.0753957
## [1] 0.02227755
## [1] -0.07837732
## [1] -0.1402696
## [1] -0.2718309

The correlation confirms this impression with a correlation of -0.07 for the subset between residual.sugar_chlorides and sulphates. Breaking down by quality:

  • 0.02 for quality 5
  • -0.07 for quality 6
  • -0.14 for quality 7
  • -0.27 for quality 8

Correlations by quality are low and increases for quality 8 but this quality8 has only a few data points.

No strong correaltion here.

## [1] 0.0521672

The correlation confirms this (0.05), that no strong relation between the quality and sulpahtes.

Let’s now look at sulfurs.


Chlorides, residual sugar and sulfurs by quality

Both plots have sulfur levels showing on all qualities and on all levels of residual.sugar and chlorides. There is slight positive relationships between those variables.

## [1] 0.2728557
## [1] 0.3028951

Correlation are indeed slight positive:

  • 0.27 between residual.sugar_chlorides and free.sulfur.dioxide
  • 0.30 between residual.sugar_chlorides and total.sulfur.dioxide

First of all, there are outliers in the dataset. I tried to remove them when their value seemed to really be out of range. But it might also underlines the diversity of wines. As I am not sure, the following graphs are based on the whole dataset. All in all, there is not some really strong correlations between quality and the other variables presented here. But I could understood some relationships regarding tastes:

Vinegar taste is not really looked for

volatile.acidity = Vinegar taste, unpleasant actually

The density were scaled so that the low counts of the extreme quality do affect the overall distribution. As the quality of the wines get higher, the volatile acidity is gettinglower. The correlation is not so strong at -0.19. The negative sign signifies that the higher the volatile acidity (and hence the vinegar taste) the lower the quality. It is not always true. This is kind of what we can expect. When you buy cheap wine (which is not always, but most of the time of lower quality) it has most of the time a pugent smell and taste, like vinegar.

Citric Acid is inefficient on white_wine quality

citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

citric acid does not play so significant role here to enhance quality.

Get high alcohol level and low sulphates for better white_wines

This graph shows smoothers of alcohol by sulphates for all quality. Quality and alcohol have a strong correlation.

The low quality (3) wines vary a lot but the overal trend is that they are low in alcohol and the alcohol level decreases as the sulphates level increases. Qualities 4 to 7 are flat overall and the only differences come from the alcohol level. Quality 7 and 8 wines are peculiar in that they are grouped in the top left corner. The relationship is not flat like the other quality (except 3) but those wines have high alcohol and less sulphates.

Eye for lower chloride presence for better wine

citric acid: the amount of salt in your wine


Final Plots and Summary

In the above analysis I tried to document my problems and solutions. I’ll repeat three of the above plots and give my reasons why they are essential.

I include this plot because this was our first strong evident for alcohol quality. We can see as before that the lower the chloride content, the better alcohol. Furthermore, low alcohol levels seems to have less sugar and high chloride content

This plot spurred my interest in exploring the relationship of chlorides and acids.

citric acid and white wine quality are very less related. At the later part of analysis we confirmed this by plotting boxplot of citric acid that it does not play so significant role here to enhance white wine quality.

The overall downward trend here with strong correlation, tells us that less salt in wine better quality.

Reflection

In this project my main focus was to do only exploratory analysis.

  1. I started the analysis by looking at the data overal. Some data points seemed really far out and during the analysis, I removed some of them. But I decided to keep them in the end as wine tastes can vary greatly.
    Looking at the different variables, I decidied to investigate what made a wine good.
  2. I started by looking at the main things that could affect taste: alcohol, chlorides, residual.sugar and pH. My first finding was really about alcohol and quality.
  3. I then moved to multivariables analyses by combining alcohol with chlorides and residual.sugar. The overall trends by quality looked quite similar so I tried to plot chlorides with residual.sugar. But the results was not what I was expecting. Thinking that another factor might be related with those two, I went through the the acidity variables and the quality to select volatile.acidity and citric.acidity and plot them with chlorides and residual.sugar.

I did the same exercice for the rest of the variables. I was looking for some more adavanced relations between the variables but was not able to find them. I am I get all in all several variables correlating with quality: 1. Alcohol 2. Volatile acidity 3. Citric acid 4. Sulphates

The limits of the dataset are really the lack of points on the lower and higher quality wines. Futhermore, the source of the quality is unknown, is it from a professionnal? A store? And we have to keep in mind that taste is a really cultural thing (see the indian food analysis) and a good wine for someone might not be for another.

Last but not least, we do not have the age of the wine. In France, one of the first thing we check for a wine is its age because common knowledge dictates that an older wine is bet Another thing that I would like is to get the name of the wines so that I can get their prices and see if how they relate to quality.